home *** CD-ROM | disk | FTP | other *** search
Text File | 1998-01-05 | 57.7 KB | 1,451 lines |
- Software-RAID mini-HOWTO
- Linas Vepstas, linas@linas.org
- v0.38 28 December 1997
-
- RAID stands for ''Redundant Array of Inexpensive Disks'', and is meant
- to be a way of creating a fast and reliable disk-drive subsystem out
- of individual disks. This document is a tutorial/HOWTO/FAQ for users
- of the Linux MD kernel extension, the associated tools, and their use.
- The MD extension implements RAID-0 (stripping), RAID-1 (mirroring),
- RAID-4 and RAID-5 in software. That is, with MD, no special hardware
- or disk controllers are required to get many of the benefits of RAID.
- This document is NOT an introduction to RAID; you must find this else¡
- where.
-
- Preamble
- This document is GPL'ed by Linas Vepstas (linas@linas.org).
- Permission to use, copy, distribute this document for any
- purpose is hereby granted, provided that the author's / editor's
- name and this notice appear in all copies and/or supporting
- documents; and that an unmodified version of this document is
- made freely available. This document is distributed in the hope
- that it will be useful, but WITHOUT ANY WARRANTY, either
- expressed or implied. While every effort has been taken to
- ensure the accuracy of the information documented herein, the
- author / editor / maintainer assumes NO RESPONSIBILITY for any
- errors, or for any damages, direct or consequential, as a result
- of the use of the information documented herein.
-
- RAID, although designed to improve system reliability by adding
- redundancy, can also lead to a false sense of security and
- confidence when used improperly. This false confidence can lead
- to even greater disasters. In particular, note that RAID is
- designed to protect against *disk* failures, and not against
- *power* failures. A power failure can damage data on the disks
- in such a way that it is not recoverable! RAID is *not* a
- substitute for proper backup of your system. Know what you are
- doing, test, be knowledgeable and aware!
-
- 1. Introduction
-
- 1. Q: What is RAID?
-
- A: RAID stands for "Redundant Array of Inexpensive Disks",
- and is meant to be a way of creating a fast and reliable
- disk-drive subsystem out of individual disks.
-
- 2. Q: What is this document?
-
- A: This document is a tutorial/HOWTO/FAQ for users of the
- Linux MD kernel extension, the associated tools, and their
- use. The MD extension implements RAID-0 (stripping), RAID-1
- (mirroring), RAID-4 and RAID-5 in software. That is, with
- MD, no special hardware or disk controllers are required to
- get many of the benefits of RAID.
-
- This document is NOT an introduction to RAID; you must find
- this elsewhere.
-
- 3. Q: What levels of RAID does the Linux kernel implement?
-
- A: Striping (RAID-0) and linear concatenation are a part of
- the stock 2.x series of kernels. This code is of production
- quality; it is well understood and well maintained. It is
- being used in some very large USENET news servers.
-
- RAID-1, RAID-4 & RAID-5 are not present in the stock kernel;
- a separate patch needs to be applied to get this functional¡
- ity. The current snapshots should be considered beta qual¡
- ity; that is, there are no known bugs but there are some
- rough edges and untested system setups.
-
- RAID-1 hot reconstruction has been recently introduced
- (August 1997) and should be considered alpha quality.
- RAID-5 hot reconstruction will be alpha quality any day now.
-
- 4. Q: Where do I get it?
-
- A: Software RAID-0 and linear mode are a stock part of all
- current Linux kernels. Patches for Software RAID-1,4,5 are
- available from
- <http://luthien.nuclecu.unam.mx/~miguel/raid>. See also the
- quasi-mirror
- <ftp://linux.kernel.org/pub/linux/daemons/raid/> for
- patches, tools and other goodies.
-
- 5. Q: Are there other Linux RAID references?
-
- A:
-
- ╖ Generic RAID overview:
- <http://www.dpt.com/uraiddoc.html>.
-
- ╖ General Linux RAID options:
- <http://linas.org/linux/raid.html>.
-
- ╖ Linux-RAID mailing list archive:
- <http://www.linuxhq.com/lnxlists>.
-
- ╖ Linux Software RAID Home Page:
- <http://luthien.nuclecu.unam.mx/~miguel/raid>.
-
- ╖ Linux Software RAID tools:
- <ftp://linux.kernel.org/pub/linux/daemons/raid/>.
-
- ╖ How to setting up linear/stripped Software RAID:
- <http://www.ssc.com/lg/issue17/raid.html>.
-
- ╖ Bootable RAID mini-HOWTO:
- <ftp://ftp.bizsystems.com/pub/raid/bootable-raid>.
-
- ╖ Linux RAID-Geschichten:
- <http://www.infodrom.north.de/~joey/Linux/raid/>.
-
- 6. Q: Who do I blame for this document?
-
- A: Linas Vepstas slapped this thing together. However, most
- of the information, and some of the words were supplied by
- ╖ Bradley Ward Allen <ulmo@Q.Net>
-
- ╖ Luca Berra <bluca@comedia.it>
-
- ╖ Brian Candler <B.Candler@pobox.com>
-
- ╖ Bohumil Chalupa <bochal@apollo.karlov.mff.cuni.cz>
-
- ╖ Anton Hristozov <anton@intransco.com>
-
- ╖ Miguel de Icaza <miguel@luthien.nuclecu.unam.mx>
-
- ╖ Ingo Molnar <mingo@pc7537.hil.siemens.at>
-
- ╖ Alvin Oga <alvin@planet.fef.com>
-
- ╖ Gadi Oxman <gadio@netvision.net.il>
-
- ╖ Michael Robinton <michael@bzs.org>
-
- ╖ Martin Schulze <joey@finlandia.infodrom.north.de>
-
- ╖ Geoff Thompson <geofft@cs.waikato.ac.nz>
-
- ╖ Edward Welbon <welbon@bga.com>
-
- ╖ Rod Wilkens <rwilkens@border.net>
-
- ╖ Johan Wiltink <j.m.wiltink@pi.net>
-
- ╖ Leonard N. Zubkoff <lnz@dandelion.com>
-
- ╖ Marc ZYNGIER <zyngier@ufr-info-p7.ibp.fr>
-
- Copyrights
-
- ╖ Copyright (C) 1994-96 Marc ZYNGIER
-
- ╖ Copyright (C) 1997 Gadi Oxman, Ingo Molnar, Miguel de
- Icaza
-
- ╖ Copyright (C) 1997 Linas Vepstas
-
- ╖ By copyright law, additional copyrights are implicitly
- held by the contributors listed above.
-
- Thanks all for being there!
-
- 2. Setup & Installation Considerations
-
- 1. Q: I must soon install Linux on new system, one requirement is to
- have RAID1. Now I'm wondering what is the easiest way to do it.
-
- A: I keep rediscovering that file-system planning is one of
- the more difficult Unix configuration tasks. To answer your
- question, I can describe what we did.
-
- We planned the following setup:
-
- ╖ two EIDE disks, 2.1.gig each.
-
- disk partition mount pt. size device
- 1 1 / 300M /dev/hda1
- 1 2 swap 64M /dev/hda2
- 1 3 /home 800M /dev/hda3
- 1 4 /var 900M /dev/hda4
-
- 2 1 /root 300M /dev/hdc1
- 2 2 swap 64M /dev/hdc2
- 2 3 /home 800M /dev/hdc3
- 2 4 /var 900M /dev/hdc4
-
- ╖ each disk is on a separate controller (& ribbon cable).
- The theory is that a controller failure and/or ribbon
- failure won't disable both disks. Possibly get
- performance boost from parallel operations?
-
- ╖ Install linux on / in /dev/hda1 this will allow booting
- and subsequent installation of raid patches, etc.
-
- ╖ /dev/hdc1 will contain a ``cold'' copy of /dev/hda1. This
- is NOT a raid copy, just a copy-copy. It's there just in
- case disk1 fails completely; we can use a rescue disk,
- mark /dev/hdc1 as bootable, and use that to keep going,
- without having to reinstall the system.
-
- The theory here is that in case of severe failure, I can
- still boot the system without worrying about raid
- superblock-corruption or other raid failure modes &
- gotchas that I don't understand.
-
- ╖ /dev/hda3 and /dev/hdc3 will be mirrors /dev/md0.
-
- ╖ /dev/hda4 and /dev/hdc4 will be mirrors /dev/md1.
-
- ╖ we picked /var and /home to be mirrored, and in separate
- partitions, under the following (convoluted ???) logic:
-
- ╖ / will contain non-changing data --- for all practical
- purposes, it will be read-only without actually being
- read-only.
-
- ╖ /home will contain slowly changing data --- an almost-
- read-only system.
-
- ╖ /var will contain rapidly changing data, including mail
- spools, database contents and web server logs.
-
- The theory is that if for some bizarre reason, the
- operating system goes wild, corruption is limited to one
- partition. Thus, if for some unlikely, hypothetical
- reason, the database starts scribbling everywhere, it
- might clobber mail and log files, but not /home.
-
- I am not entirely satisfied with my logic & reasoning,
- but it was the best I could do on short notice. I would
- like to have some scheme that verifies that files in /usr
- and /home are not changed, e.g. some MD5 signature scheme
- that is run regularly. The idea is to detect hacker
- intrusion as well as corruption. Similarly, the database
- contents are quite valuable, and I don't have a fault-
- tolerant plan for that that will let me sleep well at
- night.
- So, to complete the answer to your question:
-
- ╖ install redhat on disk 1, partition 1. do NOT mount any
- of the other partitions.
-
- ╖ install raid per instructions.
-
- ╖ configure md0 and md1.
-
- ╖ convince yourself that you know what to do in case of a
- disk failure! Discover sysadmin mistakes now, and not
- during an actual crisis. Experiment! (we turned off
- power during disk activity --- this proved to be ugly but
- informative).
-
- ╖ do some ugly mount/copy/unmount/rename/reboot scheme to
- move /var over to the /dev/md1. Done carefully, this is
- not dangerous.
-
- ╖ enjoy!
-
- 2. Q: Can I strip/mirror the root partition (/)? Why can't I boot
- Linux directly from the md disks?
-
- A: Both Lilo and Loadlin need an non-stripped/mirrored par¡
- tition to read the kernel image from. If you want to
- strip/mirror the root partition (/), then create an
- unstriped/mirrored partition. Typically, this is /boot.
- Then you either use the initial ramdisk support, or some old
- patches that were posted a while back, to allow your root
- device to be striped.
-
- There are several approaches that can be used. One approach
- is documented in detail in the Bootable RAID mini-HOWTO:
- <ftp://ftp.bizsystems.com/pub/raid/bootable-raid>.
-
- Alternately, use mkinitrd to build the ramdisk image, see
- below.
-
- Edward Welbon <welbon@bga.com> writes:
-
- ╖ To mount an md filesystem as root, the main thing is to
- build an initial file system image that has the needed
- modules and md tools to start md. I have a simple script
- that does this.
-
- ╖ For boot media, I have a small cheap SCSI disk (170MB I
- got it used for $20). This disk runs on a AHA1452, but
- it could just as well be an inexpensive IDE disk on the
- native IDE. The disk need not be very fast since it is
- mainly for boot.
-
- ╖ This disk has a small file system which contains the
- kernel and the file system image for initrd. The initial
- file system image has just enough stuff to allow me to
- load the raid SCSI device driver module and start the
- raid partition that will become root. I then do an
-
- echo 0x900 > /proc/sys/kernel/real-root-dev
-
- (0x900 is for /dev/md0) and exit linuxrc. The boot proceeds
- normally from there.
-
- ╖ I have built most support as a module except for the
- AHA1452 driver that brings in the initrd filesystem. So
- I have a fairly small kernel. The method is perfectly
- reliable, I have been doing this since before 2.1.26 and
- have never had a problem that I could not easily recover
- from. The file systems even survived several 2.1.4[45]
- hard crashes with no real problems.
-
- ╖ At one time I had partitioned the raid disks so that the
- initial cylinders of the first raid disk held the kernel
- and the initial cylinders of the second raid disk hold
- the initial file system image, instead I made the initial
- cylinders of the raid disks swap since they are the
- fastest cylinders (why waste them on boot?).
-
- ╖ The nice thing about having an inexpensive device
- dedicated to boot is that it is easy to boot from and can
- also serve as a rescue disk if necessary. If you are
- interested, you can take a look at the script that builds
- my initial ram disk image and then runs lilo.
-
- <http://www.realtime.net/~welbon/initrd.md.tar.gz>
-
- It is current enough to show the picture. It isn't espe¡
- cially pretty and it could certainly build a much smaller
- filesystem image for the initial ram disk. It would be easy
- to a make it more efficient. But it uses lilo as is. If
- you make any improvements, please forward a copy to me. 8-)
-
- 3. Q: I have heard that I can run mirroring over striping. Is this
- true? Can I run mirroring over the loopback device?
-
- A: Yes, but not the reverse. That is, you can put a stripe
- over several disks, and then build a mirror on top of this.
- However, striping cannot be put on top of mirroring.
-
- A brief technical explanation is that the linear and stripe
- personalities use the ll_rw_blk routine for access. The
- ll_rw_blk routine maps disk devices and sectors, not
- blocks. Block devices can be layered one on top of the
- other; but devices that do raw, low-level disk accesses,
- such as ll_rw_blk, cannot.
-
- Currently (November 1997) RAID cannot be run over the loop¡
- back devices, although this should be fixed shortly.
-
- 4. Q: I have two small disks and three larger disks. Can I
- concatenate the two smaller disks with RAID-0, and then create a
- RAID-5 out of that and the larger disks?
- A: Currently (November 1997), for a RAID-5 array, no. Cur¡
- rently, one can do this only for a RAID-1 on top of the con¡
- catenated drives.
-
- 5. Q: What is the difference between RAID-1 and RAID-5 for a two-disk
- configuration (i.e. the difference between a RAID-1 array built
- out of two disks, and a RAID-5 array built out of two disks)?
-
- A: There is no difference in storage capacity. Nor can
- disks be added to either array to increase capacity (see the
- question below for details).
-
- RAID-1 offers a performance advantage for reads: the RAID-1
- driver uses distributed-read technology to simultaneously
- read two sectors, one from each drive, thus doubling read
- performance.
-
- The RAID-5 driver, although it contains many optimizations,
- does not currently (September 1997) realize that the parity
- disk is actually a mirrored copy of the data disk. Thus, it
- serializes data reads.
-
- 6. Q: How can I guard against a two-disk failure?
-
- A: Some of the RAID algorithms do guard against multiple
- disk failures, but these are not currently implemented for
- Linux. However, the Linux Software RAID can guard against
- multiple disk failures by layering an array on top of an
- array. For example, nine disks can be used to create three
- raid-5 arrays. Then these three arrays can in turn be
- hooked together into a single RAID-5 array on top. In fact,
- this kind of a configuration will guard against a three-disk
- failure. Note that a large amount of disk space is
- ''wasted'' on the redundancy information.
-
- For an NxN raid-5 array,
- N=3, 5 out of 9 disks are used for parity (=55%percent;)
- N=4, 7 out of 16 disks
- N=5, 9 out of 25 disks
- ...
- N=9, 17 out of 81 disks (=~20%percent;)
-
- In general, an MxN arrary will use M+N-1 disks for parity.
- The least amount of space is "wasted" when M=N.
-
- Another alternative is to create a RAID-1 array with three
- disks. Note that since all three disks contain identical
- data, that 2/3's of the space is ''wasted''.
-
- 7. Q: I'd like to understand how it'd be possible to have something
- like fsck: if the partition hasn't been cleanly unmounted, fsck
- runs and fixes the filesystem by itself more than 90%percent; of
- the time. Since the machine is capable of fixing it by itself with
- ckraid --fix, why not make it automatic?
-
- A: Brian Candler <B.Candler@pobox.com> responds:
-
- Then you just put ckraid into your system initialization
- scripts, like fsck is. After the root partition is mounted,
- add the following to /etc/rc.d/rc.sysinit:
-
- mdadd /dev/md0 /dev/hda1 /dev/hdc1 || {
- ckraid --fix /etc/raid.usr.conf
- mdadd /dev/md0 /dev/hda1 /dev/hdc1
- }
- mdadd /dev/md1 /dev/hda2 /dev/hdc2 || {
- ckraid --fix /etc/raid.var.conf
- mdadd /dev/md0 /dev/hda2 /dev/hdc2
- }
-
- (Modify the above to suit your system.)
-
- Gadi Oxman explains the operation: In an unclean shutdown,
- Linux might be in one of the following states:
-
- ╖ The in-memory disk cache was in sync with the RAID set
- when the unclean shutdown occurred; no data was lost.
-
- ╖ The in-memory disk cache was newer than the RAID set
- contents when the crash occurred; this results in a
- corrupted filesystem and potentially in data loss.
-
- This state can be further divided to the following two
- states:
-
- ╖ Linux was writing data when the unclean shutdown
- occurred.
-
- ╖ Linux was not writing data when the crash occurred.
-
- Suppose we were using a RAID-1 array. In (2a), it might
- happen that before the crash, a small number of data
- blocks were successfully written only to some of the
- mirrors, so that on the next reboot, the mirrors will no
- longer contain the same data.
-
- If we ignore the mirror differences, the 0.36.3 read-
- balancing code might choose to read the above data blocks
- from any of the mirrors, which will result in
- inconsistent behavior (for example, the output of e2fsck
- -n /dev/md0 can differ from run to run).
-
- Since RAID doesn't protect against unclean shutdowns,
- usually there isn't any ''obviously correct'' way to fix
- the mirror differences and the filesystem corruption.
-
- For example, by default ckraid --fix will choose the
- first operational mirror and update the other mirrors
- with its contents.
-
- However, depending on the exact timing at the crash, the
- data on another mirror might be more recent, and we might
- want to use it as the source mirror instead, or perhaps
- use another method for recovery.
-
- If you wish to run ckraid --fix automatically, you can
- check the return code of mdrun for errors. For example:
-
- mdrun -p1 /dev/md0
- if [ $? -gt 0 ] ; then
- ckraid --fix /etc/raid1.conf
- mdrun -p1 /dev/md0
- fi
-
- 3. Error Recovery
-
- 1. Q: I have a RAID-1 (mirroring) setup, and lost power while there
- was disk activity. Now what do I do?
-
- A: The redundancy of RAID levels is designed to protect
- against a disk failure, not against a power failure.
-
- There are several ways to recover from this situation.
-
- ╖ Method (1): Use the raid tools. These can be used to
- sync the raid arrays. They do not fix file-system
- damage; after the raid arrays are sync'ed, then the file-
- system still has to be fixed with fsck. Raid arrays can
- be checked with ckraid /etc/raid1.conf (for RAID-1, else,
- /etc/raid5.conf, etc.)
-
- Calling ckraid /etc/raid1.conf --fix will pick one of the
- disks in the array (usually the first), and use that as
- the master copy, and copy its blocks to the others in the
- mirror.
-
- To designate which of the disks should be used as the
- master, you can use the --force-source flag: for example,
- ckraid /etc/raid1.conf --fix --force-source /dev/hdc3
-
- The ckraid command can be safely run without the --fix
- option to verify the inactive RAID array without making
- any changes. When you are comfortable with the proposed
- changes, supply the --fix option.
-
- ╖ Method (2): Paranoid, time-consuming, not much better
- than the first way. Lets assume a two-disk RAID-1 array,
- consisting of partitions /dev/hda3 and /dev/hdc3. You
- can try the following:
-
- a. fsck /dev/hda3
-
- b. fsck /dev/hdc3
-
- c. decide which of the two partitions had fewer errors,
- or were more easily recovered, or recovered the data
- that you wanted. Pick one, either one, to be your new
- ``master'' copy. Say you picked /dev/hdc3.
-
- d. dd if=/dev/hdc3 of=/dev/hda3
-
- e. mkraid raid1.conf -f --only-superblock
-
- Instead of the last two steps, you can instead run ckraid
- /etc/raid1.conf --fix --force-source /dev/hdc3 which
- should be a bit faster.
-
- ╖ Method (3): Lazy man's version of above. If you don't
- want to wait for long fsck's to complete, it is perfectly
- fine to skip the first three steps above, and move
- directly to the last two steps. Just be sure to run fsck
- /dev/md0 after you are done. Method (3) is actually just
- method (1) in disguise.
-
- In any case, the above steps will only sync up the raid
- arrays. The file system probably needs fixing as well:
- for this, fsck needs to be run on the active, unmounted
- md device.
-
- With a three-disk RAID-1 array, there are more
- possibilities, such as using two disks to ''vote'' a
- majority answer. Tools to automate this do not currently
- (September 97) exist.
-
- 2. Q: I have a RAID-4 or a RAID-5 (parity) setup, and lost power while
- there was disk activity. Now what do I do?
-
- A: The redundancy of RAID levels is designed to protect
- against a disk failure, not against a power failure.
-
- Since the disks in a RAID-4 or RAID-5 array do not contain a
- file system that fsck can read, there are fewer repair
- options. You cannot use fsck to do preliminary checking
- and/or repair; you must use ckraid first.
-
- The ckraid command can be safely run without the --fix
- option to verify the inactive RAID array without making any
- changes. When you are comfortable with the proposed
- changes, supply the --fix option.
-
- If you wish, you can try designating one of the disks as a
- ''failed disk''. Do this with the --suggest-failed-disk-
- mask flag. Only one bit should be set in the flag: RAID-5
- cannot recover two failed disks. The mask is a binary bit
- mask: thus:
-
- 0x1 == first disk
- 0x2 == second disk
- 0x4 == third disk
- 0x8 == fourth disk, etc.
-
- Alternately, you can choose to modify the parity sectors, by
- using the --suggest-fix-parity flag. This will recompute
- the parity from the other sectors.
-
- The flags --suggest-failed-dsk-mask and --suggest-fix-parity
- can be safely used for verification. No changes are made if
- the --fix flag is not specified. Thus, you can experiment
- with different possible repair schemes.
-
- 3. Q: My RAID-1 device, /dev/md0 consists of two hard drive
- partitions: /dev/hda3 and /dev/hdc3. Recently, the disk with
- /dev/hdc3 failed, and was replaced with a new disk. My best
- friend, who doesn't understand RAID, said that the correct thing to
- do now is to ''dd if=/dev/hda3 of=/dev/hdc3''. I tried this, but
- things still don't work.
-
- A: You should keep your best friend away from you computer.
- Fortunately, no serious damage has been done. You can
- recover from this by running:
-
- mkraid raid1.conf -f --only-superblock
-
- By using dd, two identical copies of the partition were cre¡
- ated. This is almost correct, except that the RAID-1 kernel
- extension expects the RAID superblocks to be different.
- Thus, when you try to reactivate RAID, the software will
- notice the problem, and deactivate one of the two parti¡
- tions. By re-creating the superblock, you should have a
- fully usable system.
-
- 4. Q: My RAID-1 device, /dev/md0 consists of two hard drive
- partitions: /dev/hda3 and /dev/hdc3. My best (girl?)friend, who
- doesn't understand RAID, ran fsck on /dev/hda3 while I wasn't
- looking, and now the RAID won't work. What should I do?
-
- A: You should re-examine your concept of ``best friend''.
- In general, fsck should never be run on the individual par¡
- titions that compose a RAID array. Assuming that neither of
- the partitions are/were heavily damaged, no data loss has
- occurred, and the RAID-1 device can be recovered as follows:
-
- a. make a backup of the file system on /dev/hda3
-
- b. dd if=/dev/hda3 of=/dev/hdc3
-
- c. mkraid raid1.conf -f --only-superblock
-
- This should leave you with a working disk mirror.
-
- 5. Q: Why does the above work as a recovery procedure?
-
- A: Because each of the component partitions in a RAID-1 mir¡
- ror is a perfectly valid copy of the file system. In a
- pinch, mirroring can be disabled, and one of the partitions
- can be mounted and safely run as an ordinary, non-RAID file
- system. When you are ready to restart using RAID-1, then
- unmount the partition, and follow the above instructions to
- restore the mirror. Note that the above works ONLY for
- RAID-1, and not for any of the other levels.
-
- It may make you feel more comfortable to reverse the direc¡
- tion of the copy above: copy from the disk that was
- untouched to the one that was. Just be sure to fsck the
- final md.
-
- 6. Q: I am confused by the above questions, but am not yet bailing
- out. Is it safe to run fsck /dev/md0 ?
-
- A: Yes, it is safe to run fsck on the md devices. In fact,
- this is the only safe place to run fsck.
-
- 7. Q: If a disk is slowly failing, will it be obvious which one it is?
- I am concerned that it won't be, and this confusion could lead to
- some dangerous decisions by a sysadmin.
-
- A: Once a disk fails, an error code will be returned from
- the low level driver to the RAID driver. The RAID driver
- will mark it as ``bad'' in the RAID superblocks of the
- ``good'' disks (so we will later know which mirrors are good
- and which aren't), and continue RAID operation on the
- remaining operational mirrors.
-
- This, of course, assumes that the disk and the low level
- driver can detect a read/write error, and will not silently
- corrupt data, for example. This is true of current drives
- (error detection schemes are being used internally), and is
- the basis of RAID operation.
-
- 8. Q: What about hot-repair?
-
- A: Work is underway to complete ``hot reconstruction''.
- With this feature, one can add several ``spare'' disks to
- the RAID set (be it level 1 or 4/5), and once a disk fails,
- it will be reconstructed on one of the spare disks in run
- time, without ever needing to shut down the array.
-
- However, to use this feature, the spare disk must have been
- declared at boot time, or it must be hot-added, which
- requires the use of special cabinets and connectors that
- allow a disk to be added while the electrical power is on.
-
- As of October 97, there is a beta version of MD that allows:
-
- ╖ RAID 1 and 5 reconstruction on spare drives
-
- ╖ RAID-5 parity reconstruction after an unclean shutdown
-
- ╖ spare disk to be hot-added to an already running RAID 1
- or 4/5 array
-
- By default, automatic reconstruction is (Dec 97)
- currently disabled by default, due to the preliminary
- nature of this work. It can be enabled by changing the
- value of SUPPORT_RECONSTRUCTION in include/linux/md.h.
-
- If spare drives were configured into the array when it
- was created and kernel-based reconstruction is enabled,
- the spare drive will already contain a RAID superblock
- (written by mkraid), and the kernel will reconstruct its
- contents automatically (without needing the usual mdstop,
- replace drive, ckraid, mdrun steps).
-
- If you are not running automatic reconstruction, and have
- not configured a hot-spare disk, the procedure described
- by Gadi Oxman <gadio@netvision.net.il> is recommended:
-
- ╖ Currently, once the first disk is removed, the RAID set
- will be running in degraded mode. To restore full
- operation mode, you need to:
-
- ╖ stop the array (mdstop /dev/md0)
-
- ╖ replace the failed drive
-
- ╖ run ckraid raid.conf to reconstruct its contents
-
- ╖ run the array again (mdadd, mdrun).
-
- At this point, the array will be running with all the
- drives, and again protects against a failure of a single
- drive.
-
- Currently, it is not possible to assign single hot-spare
- disk to several arrays. Each array requires it's own
- hot-spare.
-
- 9. Q: I would like to have an audible alarm for ``you schmuck, one
- disk in the mirror is down'', so that the novice sysadmin knows
- that there is a problem.
-
- A: The kernel is logging the event with a ``KERN_ALERT''
- priority in syslog. There are several software packages
- that will monitor the syslog files, and beep the PC speaker,
- call a pager, send e-mail, etc. automatically.
-
- 10.
- Q: How do I run RAID-5 in degraded mode (with one disk failed, and
- not yet replaced)?
-
- A: Gadi Oxman <gadio@netvision.net.il> writes:
-
- ╖ Normally, to run a RAID-5 set of n drives you have to:
-
- mdadd /dev/md0 /dev/disk1 ... /dev/disk(n)
- mdrun -p5 /dev/md0
-
- Even if one of the disks has failed, you still have to mdadd
- it as you would in a normal setup. Then,
-
- ╖ The array will be active in degraded mode with (n - 1)
- drives. If ``mdrun'' fails, the kernel has noticed an
- error (for example, several faulty drives, or an unclean
- shutdown). Use ``dmesg'' to display the kernel error
- messages from ``mdrun''.
-
- If the raid-5 set is corrupted due to a power loss,
- rather than a disk crash, one can try to recover by
- creating a new RAID superblock:
-
- mkraid -f --only-superblock raid5.conf
-
- A RAID array doesn't provide protection against a power
- failure or a kernel crash, and can't guarantee correct
- recovery. Rebuilding the superblock will simply cause the
- system to ignore the condition by marking all the drives as
- ``OK'', as if nothing happened.
-
- 11.
- Q: How does RAID-5 work when a disk fails?
-
- A: The typical operating scenario is as follows:
-
- ╖ A RAID-5 array is active.
-
- ╖ One drive fails while the array is active.
-
- ╖ The drive firmware and the low-level Linux
- disk/controller drivers detect the failure and report an
- error code to the MD driver.
-
- ╖ The MD driver continues to provide an error-free /dev/md0
- device to the higher levels of the kernel (with a
- performance degradation) by using the remaining
- operational drives.
-
- ╖ The sysadmin can umount /dev/md0 and mdstop /dev/md0 as
- usual.
-
- ╖ If the failed drive is not replaced, the sysadmin can
- still start the array in degraded mode as usual, by
- running mdadd and mdrun.
-
- 12.
- Q: The QuickStart says that mdstop is just to make sure that the
- disks are sync'ed. Is this REALLY necessary? Isn't unmounting the
- file systems enough?
-
- A: The command mdstop /dev/md0 will:
-
- ╖ mark it ''clean''. This allows us to detect unclean
- shutdowns, for example due to a power failure or a kernel
- crash.
-
- ╖ sync the array. This is less important after unmounting a
- filesystem, but is important if the /dev/md0 is accessed
- directly rather than through a filesystem (for example,
- by e2fsck).
-
- 13.
- Q:
-
- A:
-
- 14.
- Q: Why is there no question 13?
-
- A: If you are concerned about RAID, High Availability, and
- UPS, then its probably a good idea to be superstitious as
- well.
-
- 4. Troubleshooting Install Problems
-
- 1. Q: What is the current best known-stable patch for RAID in the
- 2.0.x series kernels?
-
- A: As of 18 Sept 1997, it is "2.0.30 + pre-9 2.0.31 + Werner
- Fink's swapping patch + the alpha RAID patch". As of Novem¡
- ber 1997, it is 2.0.31 + ... !?
-
- 2. Q: The RAID patches will not install cleanly for me. What's wrong?
-
- A: Make sure that /usr/include/linux is a symbolic link to
- /usr/src/linux/include/linux.
-
- Make sure that the new files raid5.c, etc. have been copied
- to their correct locations. Sometimes the patch command
- will not create new files. Try the -f flag on patch.
-
- 3. Q: While compiling raidtools 0.42, compilation stops trying to
- include <pthread.h> but it doesn't exist in my system. How do I
- fix this?
-
- A: raidtools-0.42 requires linuxthreads-0.6 from:
- <ftp://ftp.inria.fr/INRIA/Projects/cristal/Xavier.Leroy>
- Alternately, use glibc v2.0.
-
- 4. Q: I get the message: mdrun -a /dev/md0: Invalid argument
-
- A: Use mkraid to initialize the RAID set prior to the first
- use. mkraid ensures that the RAID array is initially in a
- consistent state by erasing the RAID partitions. In addi¡
- tion, mkraid will create the RAID superblocks.
-
- 5. Q: I get the message: mdrun -a /dev/md0: Invalid argument The setup
- was:
-
- ╖ raid1 build as a kernel module
-
- ╖ normal install procedure followed ... mdcreate, mdadd, etc.
-
- ╖ cat /proc/mdstat shows
-
- Personalities :
- read_ahead not set
- md0 : inactive sda1 sdb1 6313482 blocks
- md1 : inactive
- md2 : inactive
- md3 : inactive
-
- ╖ mdrun -a creates the error message /dev/md0: Invalid argument
-
- A: Try lsmod to see if the modules is loaded, and if not,
- load it with modprobe raid1.
-
- 6. Q: Truxton Fulton wrote:
-
- On my Linux 2.0.30 system, while doing a mkraid for a RAID-1
- device, during the clearing of the two individual parti¡
- tions, I got "Cannot allocate free page" errors appearing on
- the console, and "Unable to handle kernel paging request at
- virtual address ..." errors in the system log. At this
- time, the system became quite unusable, but it appears to
- recover after a while. The operation appears to have com¡
- pleted with no other errors, and I am successfully using my
- RAID-1 device. The errors are disconcerting though. Any
- ideas?
-
- A: This was a well-known bug in the 2.0.30 kernels. It is
- fixed in the 2.0.31 kernel; alternately, fall back to
- 2.0.29.
-
- 7. Q: I'm not able to mdrun a RAID-1, RAID-4 or RAID-5 device. If I
- try to mdrun a mdadd'ed device I get the message ''invalid raid
- superblock magic''.
-
- A: Make sure that you've run the mkraid part of the install
- procedure.
-
- 8. Q: When I access /dev/md0, the kernel spits out a lot of errors
- like md0: device not running, giving up ! and I/O error.... I've
- successfully added my devices to the virtual device.
-
- A: To be usable, the device must be running. Use mdrun -px
- /dev/md0 where x is l for linear, 0 for RAID-0 or 1 for
- RAID-1, etc.
-
- 9. Q: I've created a linear md-dev with 2 devices. cat /proc/mdstat
- shows the total size of the device, but df only shows the size of
- the first physical device.
-
- A: You must mkfs your new md-dev before using it the first
- time, so that the filesystem will cover the whole device.
-
- 10.
- Q: I've set up /etc/mdtab using mdcreate, I've mdadd'ed, mdrun and
- fsck'ed my two /dev/mdX partitions. Everything looks okay before a
- reboot. As soon as I reboot, I get an fsck error on both
- partitions: fsck.ext2: Attempt to read block from filesystem
- resulted in short read while trying too open /dev/md0. Why?! How
- do I fix it?!
-
- A: During the boot process, the RAID partitions must be
- started before they can be fsck'ed. This must be done in
- one of the boot scripts. For some distributions, fsck is
- called from /etc/rc.d/rc.S, for others, it is called from
- /etc/rc.d/rc.sysinit. Change this file to mdadd -ar *before*
- fsck -A is executed. Better yet, it is suggested that
- ckraid be run if mdadd returns with an error. How do do
- this is discussed in greater detail in question 14 of the
- section ''Error Recovery''.
-
- 11.
- Q: I get the message invalid raid superblock magic while trying to
- run an array which consists of partitions which are bigger than
- 4GB.
-
- A: This bug is now fixed. (September 97) Make sure you have
- the latest raid code.
-
- 12.
- Q: I get the message Warning: could not write 8 blocks in inode
- table starting at 2097175 while trying to run mke2fs on a partition
- which is larger than 2GB.
-
- A: This seems to be a problem with mke2fs (November 97). A
- temporary work-around is to get the mke2fs code, and add
- #undef HAVE_LLSEEK to e2fsprogs-1.10/lib/ext2fs/llseek.c
- just before the first #ifdef HAVE_LLSEEK and recompile
- mke2fs.
-
- 13.
- Q: ckraid currently isn't able to read /etc/mdtab
-
- A: The RAID0/linear configuration file format used in
- /etc/mdtab is obsolete, although it will be supported for a
- while more. The current, up-to-date config files are cur¡
- rently named /etc/raid1.conf, etc.
-
- 14.
- Q: The personality modules (raid1.o) are not loaded automatically;
- they have to be manually modprobe'd before mdrun. How can this be
- fixed?
-
- A: To autoload the modules, we can add the following to
- /etc/conf.modules:
-
- alias md-personality-3 raid1
- alias md-personality-4 raid5
-
- 15.
- Q: I've mdadd'ed 13 devices, and now I'm trying to mdrun -p5
- /dev/md0 and get the message: /dev/md0: Invalid argument
-
- A: The default configuration for software RAID is 8 real
- devices. Edit linux/md.h, change #define MAX_REAL=8 to a
- larger number, and rebuild the kernel.
-
- 16.
- Q: I can't make md work with partitions on our latest SPARCstation
- 5. I suspect that this has something to do with disk-labels.
-
- A: Sun disk-labels sit in the first 1K of a partition. For
- RAID-1, the Sun disk-label is not an issue since ext2fs will
- skip the label on every mirror. For other raid levels (0,
- linear and 4/5), this appears to be a problem; it has not
- yet (Dec 97) been addressed.
-
- 5. Supported Hardware
-
- 1. Q: I have SCSI adapter brand XYZ (with or without several
- channels), and disk brand(s) PQR and LMN, will these work with md
- to create a linear/stripped/mirrored personality?
-
- A: Yes! Software RAID will work with any disk controller
- (IDE or SCSI) and any disks. The disks do not have to be
- identical, nor do the controllers. For example, a RAID mir¡
- ror can be created with one half the mirror being a SCSI
- disk, and the other an IDE disk. The disks do not even have
- to be the same size. There are no restrictions on the
- mixing & matching of disks and controllers.
-
- This is because Software RAID works with disk partitions,
- not with the raw disks themselves. The only recommendation
- is that for RAID levels 1 and 5, the disk partitions that
- are used as part of the same set be the same size. If the
- partitions used to make up the RAID 1 or 5 array are not the
- same size, then the excess space in the larger partitions is
- wasted (not used).
-
- 2. Q: I have a twin channel BT-952, and the box states that it
- supports hardware RAID 0, 1 and 0+1. I have made a RAID set with
- two drives, the card apparently recognizes them when it's doing
- it's BIOS startup routine. I've been reading in the driver source
- code, but found no reference to the hardware RAID support. Anybody
- out there working on that?
-
- A: The Mylex/BusLogic FlashPoint boards with RAIDPlus are
- actually software RAID, not hardware RAID at all. RAIDPlus
- is only supported on Windows 95 and Windows NT, not on Net¡
- ware or any of the Unix platforms. Aside from booting and
- configuration, the RAID support is actually in the OS
- drivers.
-
- While in theory Linux support for RAIDPlus is possible, the
- implementation of RAID-0/1/4/5 in the Linux kernel is much
- more flexible and should have superior performance, so
- there's little reason to support RAIDPlus directly.
-
- 6. Modifying an Existing Installation
-
- 1. Q: Are linear MD's expandable? Can a new hard-drive/partition be
- added, and the size of the existing file system expanded?
-
- A: Miguel de Icaza <miguel@luthien.nuclecu.unam.mx> writes:
-
- I changed the ext2fs code to be aware of multiple-devices
- instead of the regular one device per file system assump¡
- tion.
-
- So, when you want to extend a file system, you run a utility
- program that makes the appropriate changes on the new device
- (your extra partition) and then you just tell the system to
- extend the fs using the specified device.
-
- You can extend a file system with new devices at system
- operation time, no need to bring the system down (and when¡
- ever I get some extra time, you will be able to remove
- devices from the ext2 volume set, again without even having
- to go to single-user mode or any hack like that).
-
- You can get the patch for 2.1.x kernel from my web page:
-
- <http://www.nuclecu.unam.mx/~miguel/ext2-volume>
-
- 2. Q: Can I add disks to a RAID-5 array?
-
- A: Currently, (September 1997) no, not without erasing all
- data. A conversion utility to allow this does not yet exist.
- The problem is that the actual structure and layout of a
- RAID-5 array depends on the number of disks in the array.
-
- Of course, one can add drives by backing up the array to
- tape, deleting all data, creating a new array, and restoring
- from tape.
-
- 3. Q: What would happen to my RAID1/RAID0 sets if I shift one of the
- drives from being /dev/hdb to /dev/hdc?
-
- Because of cabling/case size/stupidity issues, I had to make my
- RAID sets on the same IDE controller (/dev/hda and /dev/hdb). Now
- that I've fixed some stuff, I want to move /dev/hdb to /dev/hdc.
-
- What would happen if I just change the /etc/mdtab and
- /etc/raid1.conf files to reflect the new location?
-
- A: For RAID-0/linear, one must be careful to specify the
- drives in exactly the same order. Thus, in the above exam¡
- ple, if the original config is
-
- mdadd /dev/md0 /dev/hda /dev/hdb
-
- Then the new config *must* be
-
- mdadd /dev/md0 /dev/hda /dev/hdc
-
- For RAID-1/4/5, the drive's ''RAID number'' is stored in its
- RAID superblock, and therefore the order in which the disks
- are specified is not important.
-
- RAID-0/linear does not have a superblock due to it's older
- design, and the desire to maintain backwards compatibility
- with this older design.
-
- 7. Performance, Tools & General Bone-headed Questions
-
- 1. Q: I've created a RAID-0 device on /dev/sda2 and /dev/sda3. The
- device is a lot slower than a single partition. Isn't md a pile of
- junk?
-
- A: To have a RAID-0 device running a full speed, you must
- have partitions from different disks. Besides, putting the
- two halves of the mirror on the same disk fails to give you
- any protection whatsoever against disk failure.
- 2. Q: I have 2 Brand X super-duper hard disks and a Brand Y
- controller. and am considering using md. Does it significantly
- increase the throughput? Is the performance really noticeable?
-
- A: The answer depends on the configuration that you use.
-
- Linux MD RAID-0 (striping) performance:
- Must wait for all disks to read/write the stripe.
-
- Linux MD RAID-1 (mirroring) read performance:
- MD implements read balancing. In a low-IO situation,
- this won't change performance. But, with two disks in
- a high-IO environment, this could as much as double
- the read performance. For N disks in the mirror, this
- could improve performance N-fold.
-
- Linux MD RAID-1 (mirroring) write performance:
- Must wait for the write to occur to all of the disks
- in the mirror.
-
- 3. Q: What is the optimal RAID-5 configuration for performance?
-
- A: Since RAID-5 attempts to equally distribute the I/O load
- across several drives, the best performance will be obtained
- when the RAID set is balanced by using identical drives,
- identical controllers, and the same (low) number of drives
- on each controller.
-
- Note, however, that using identical components might raise
- the probability of multiple drives failures.
-
- 4. Q: What is the optimal block size for a RAID-4/5 array?
-
- A: When using the current (November 1997) RAID-4/5 implemen¡
- tation, it is strongly recommended that the file system be
- created with mke2fs -b 4096 instead of the default 1024 byte
- filesystem block size.
-
- This is because the current RAID-5 implementation allocates
- one 4K memory page per disk block; thus 75% of the memory
- which RAID-5 is allocating for pending I/O is not being
- used. With a 4096 block size, it will potentially queue 4
- times as much pending I/O to the low level drivers without
- allocating additional memory.
-
- Note: the 4K memory page size applies to the Intel x86
- architecture. I think memory pages are 8K on Alpha/Sparc
- (????), and thus the above figures should be adjusted
- accordingly.
-
- Note: if your file system has a lot of small files (files
- less than 10KBytes in size), a considerable fraction of the
- disk space might be wasted. This is because disk space is
- allocated in multiples of the block size. Allocating large
- blocks for small files clearly results in a waste of disk
- space.
-
- Note: the above remarks do NOT apply to Software
- RAID-0/1/linear.
-
- Note: most ''typical'' systems do not have that many small
- files. That is, although there might be thousands of small
- files, this would lead to only some 10 to 100MB wasted
- space, which is probably an acceptable tradeoff for perfor¡
- mance on a multi-gigabyte disk.
-
- Note: for news servers, there might be tens or hundreds of
- thousands of small files. In such cases, the smaller block
- size may be more important than the improved performance.
-
- Note: there exists an experimental file system for Linux
- which packs small files and file chunks onto a single block.
- It apparently has some very positive performance implica¡
- tions when the average file size is much smaller than the
- block size.
-
- Note: Future versions may implement schemes that obsolete
- the above discussion. However, this is difficult to imple¡
- ment, since dynamic run-time allocation can lead to dead-
- locks; the current implementation performs a static pre-
- allocation.
-
- 5. Q: How does the chunk size influence the speed of my RAID device?
-
- A: The chunk size is the amount of data contiguous on the
- virtual device that is also contiguous on the physical
- device. Depending on your workload, the best is to let the
- chunk size match the size of your requests, so two requests
- have chances to be on different disks, and to be run the
- same time. This suppose a lot of testing with different
- chunk sizes to match the average request size, and to have
- the best performances.
-
- 6. Q: Where can I put the md commands in the startup scripts, so that
- everything will start automatically at boot time?
-
- A: Rod Wilkens <rwilkens@border.net> writes:
-
- What I did is put ``mdadd -ar'' in the
- ``/etc/rc.d/rc.sysinit'' right after the kernel loads the
- modules, and before the ``fsck'' disk check. This way, you
- can put the ``/dev/md?'' device in the ``/etc/fstab''. Then
- I put the ``mdstop -a'' right after the ``umount -a''
- unmounting the disks, in the ``/etc/rc.d/init.d/halt'' file.
-
- For raid-5, you will want to look at the return code for mdadd, and if
- it failed, do a
-
- ckraid --fix /etc/raid5.conf
-
- to repair any damage.
-
- 7. Q: I was wondering if it's possible to setup stripping with more
- than 2 devices in md0? This is for a news server, and I have 9
- drives... Needless to say I need much more than two. Is this
- possible?
-
- A: Yes. (describe how to do this)
-
- 8. Q: When is Software RAID superior to Hardware RAID?
-
- A: Normally, Hardware RAID is considered superior to Soft¡
- ware RAID, because hardware controllers often have a large
- cache, and can do a better job of scheduling operations in
- parallel. However, integrated Software RAID can (and does)
- gain certain advantages from being close to the operating
- system.
-
- For example, ... ummm. Opaque description of caching of
- reconstructed blocks in buffer cache elided ...
-
- On a dual PPro SMP system, it has been reported that Soft¡
- ware-RAID performance exceeds the performance of a well-
- known hardware-RAID board vendor by a factor of 2 to 5.
-
- Software RAID is also a very interesting option for high-
- availability redundant server systems. In such a configura¡
- tion, two CPU's are attached to one set or SCSI disks. If
- one server crashes or fails to respond, then the other
- server can mdadd, mdrun and mount the software RAID array,
- and take over operations. This sort of dual-ended operation
- is not always possible with many hardware RAID controllers,
- because of the state configuration that the hardware con¡
- trollers maintain.
-
- 9. Q: If I upgrade my version of raidtools, will it have trouble
- manipulating older raid arrays? In short, should I recreate my
- RAID arrays when upgrading the raid utilities?
-
- A: No, not unless the major version number changes. An MD
- version x.y.z consists of three sub-versions:
-
- x: Major version.
- y: Minor version.
- z: Patchlevel version.
-
- Version x1.y1.z1 of the RAID driver supports a RAID array
- with version x2.y2.z2 in case (x1 == x2) and (y1 >= y2).
-
- Different patchlevel (z) versions for the same (x.y) version
- are designed to be mostly compatible.
-
- The minor version number is increased whenever the RAID
- array layout is changed in a way which is incompatible with
- older versions of the driver. New versions of the driver
- will maintain compatibility with older RAID arrays.
-
- The major version number will be increased if it will no
- longer make sense to support old RAID arrays in the new
- kernel code.
-
- For RAID-1, it's not likely that the disk layout nor the
- superblock structure will change anytime soon. Most all Any
- optimization and new features (reconstruction, multithreaded
- tools, hot-plug, etc.) doesn't affect the physical layout.
-
- 10.
- Q: The command mdstop /dev/md0 says that the device is busy.
-
- A: There's a process that has a file open on /dev/md0, or
- /dev/md0 is still mounted. Terminate the process or umount
- /dev/md0.
-
- 11.
- Q: Are there performance tools?
-
- A: There is also a new utility called iotrace in the
- linux/iotrace directory. It reads /proc/io-trace and analy¡
- ses/plots it's output. If you feel your system's block IO
- performance is too low, just look at the iotrace output.
-
- 12.
- Q: I was reading the RAID source, and saw the value SPEED_LIMIT
- defined as 1024K/sec. What does this mean? Does this limit
- performance?
-
- A: SPEED_LIMIT is used to limit RAID reconstruction speed
- during automatic reconstruction. Basically, automatic
- reconstruction allows you to e2fsck and mount immediately
- after an unclean shutdown, without first running ckraid.
- Automatic reconstruction is also used after a failed hard
- drive has been replaced.
-
- In order to avoid overwhelming the system while reconstruc¡
- tion is occurring, the reconstruction thread monitors the
- reconstruction speed and slows it down if its too fast. The
- 1M/sec limit was arbitrarily chosen as a reasonable rate
- which allows the reconstruction to finish reasonably
- rapidly, while creating only a light load on the system so
- that other processes are not interfered with.
-
- 13.
- Q: What about ''spindle synchronization'' or ''disk
- synchronization''?
-
- A: Spindle synchronization is used to keep multiple hard
- drives spinning at exactly the same speed, so that their
- disk platters are always perfectly aligned. This is used by
- some hardware controllers to better organize disk writes.
- However, for software RAID, this information is not used,
- and spindle synchronization might even hurt performance.
-
- 8. Questions Waiting for Answers
-
- 1. Q: What are the option you have used for formating the (raid)
- disks? I used:
-
- mke2fs -b 4096 -R stride=4 ... blah
-
- or is it supposed to be 64K ╫ 4 drives:
-
- mke2fs -b 4096 -R stride=262000 ... blah
-
- are there any other options ?
-
- stride blocks are filesystem blocks, not virtual memory
- pages.
-
- Is there a paper somewhere about what does the stride option
- do ? Also, in relation with the ``md'' device driver ?
-
- The stride option serves only one purpose: it tells mke2fs
- how many file system blocks will be written to each member
- in turn. This allows mke2fs to allocate the block and inode
- bitmaps so that they don't all end up on the same physical
- drive. I noticed last spring that one drive in a pair
- always had a larger I/O count, and tracked it down to the
- these meta-data blocks. Ted added the -R stride= option in
- response to my explanation and request for a workaround.
-
- > For a 4KB block file system, with stripe size 32KB, one
- would use -R > stride=8.
-
- 2. Q: For testing the raw disk thru put... is there a character
- device for raw read/raw writes instead of /dev/sdaxx that we can
- use to measure performance on the raid drives?? is there a GUI
- based tool to use to watch the disk thru-put??
-
- 9. Wish List of Enhancements to MD and Related Software
-
- Bradley Ward Allen <ulmo@Q.Net> wrote:
-
- Ideas include:
-
- ╖ Bootup parameters to tell the kernel which devices are to
- be MD devices (no more ``mdadd'')
-
- ╖ Making MD transparent to ``mount''/``umount'' such that
- there is no ``mdrun'' and ``mdstop''
-
- ╖ Integrating ``ckraid'' entirely into the kernel, and
- letting it run as needed
-
- (So far, all I've done is suggest getting rid of the
- tools and putting them into the kernel; that's how I feel
- about it, this is a filesystem, not a toy.)
-
- ╖ Deal with arrays that can easily survive N disks going
- out simultaneously or at separate moments, where N is a
- whole number > 0 settable by the administrator
-
- ╖ Handle kernel freezes, power outages, and other abrupt
- shutdowns better
-
- ╖ Don't disable a whole disk if only parts of it have
- failed, e.g., if the sector errors are confined to less
- than 50% of access over the attempts of 20 dissimilar
- requests, then it continues just ignoring those sectors
- of that particular disk.
-
- ╖ Bad sectors:
-
- ╖ A mechanism for saving which sectors are bad, someplace
- onto the disk.
-
- ╖ If there is a generalized mechanism for marking degraded
- bad blocks that upper filesystem levels can recognize,
- use that. Program it if not.
-
- ╖ Perhaps alternatively a mechanism for telling the upper
- layer that the size of the disk got smaller, even
- arranging for the upper layer to move out stuff from the
- areas being eliminated. This would help with a degraded
- blocks as well.
-
- ╖ Failing the above ideas, keeping a small (admin settable)
- amount of space aside for bad blocks (distributed evenly
- across disk?), and using them (nearby if possible)
- instead of the bad blocks when it does happen. Of
- course, this is inefficient. Furthermore, the kernel
- ought to log every time the RAID array starts each bad
- sector and what is being done about it with a ``crit''
- level warning, just to get the administrator to realize
- that his disk has a piece of dust burrowing into it (or a
- head with platter sickness).
-
- ╖ Software-switchable disks:
-
- ``disable this disk''
- would block until kernel has completed making sure
- there is no data on the disk being shut down that is
- needed (e.g., to complete an XOR/ECC/other error
- correction), then release the disk from use (so it
- could be removed, etc.);
-
- ``enable this disk''
- would mkraid a new disk if appropriate and then start
- using it for ECC/whatever operations, enlarging the
- RAID5 array as it goes;
-
- ``resize array''
- would respecify the total number of disks and the
- number of redundant disks, and the result would often
- be to resize the size of the array; where no data loss
- would result, doing this as needed would be nice, but
- I have a hard time figuring out how it would do that;
- in any case, a mode where it would block (for possibly
- hours (kernel ought to log something every ten seconds
- if so)) would be necessary;
-
- ``enable this disk while saving data''
- which would save the data on a disk as-is and move it
- to the RAID5 system as needed, so that a horrific save
- and restore would not have to happen every time
- someone brings up a RAID5 system (instead, it may be
- simpler to only save one partition instead of two, it
- might fit onto the first as a gzip'd file even);
- finally,
-
- ``re-enable disk''
- would be an operator's hint to the OS to try out a
- previously failed disk (it would simply call disable
- then enable, I suppose).
-
-